This report explore how the red wine quality could be affected by different chemical ingredients.
At the beginning let’s explore our data structure.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Description: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
## [1] "Fixed.Acidity Statistics:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity (tartaric acid) has a slight right tail but I think overall it has a normal distribution with mean of 8.32 g/dm3 and median 7.9 g/dm3:
Description: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
## [1] "Volatile.Acidity Statistics Summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The Volatile.Acidity Histogram shows right skewed distribution. I believe that the volatile acidity values must be small as the more amount of acetic acid wine the more unpleasant taste we get that interpret most of values are less than 1.0 g/dm3. I wonder what will be the quality of wine that has more than 1 g/dm3 of acidic acid?
Description: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
## [1] "Citirc.Acid Statistics Summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Top 5 values for Citric Acid:"
##
## 0 0.49 0.24 0.02 0.26
## 132 68 51 50 38
Citric.Acid quantities histogram shows a right skewed distribution that peaking around 0.0 , 0.25 & 0.5 g/dm3 of citric acid. I thought the majority of wine observations will have citric acid as an ingredient, but it seems I was wrong. The highest percent of red wine have 0 g/dm3 of citric acid. I am thinking does that means the quality will be decreased if the wine doesn’t have citric acid as it might miss the ‘freshness’ and good flavor feeling. We will see how the wine quality will be affected by this in next sections.
Description: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
The Residual.Sugar histogram shows a right skewed distribution. It also shows some red wine observations have less than 1 g/dm3 of sugar ( 2 observations has 0.9 g/dm3 ). Most of wines have a value of sugar that between 1.2 and 3.5 g/dm3.
## [1] "Wines with less than 1g/dm3 of sugar amount:"
##
## 0.9
## 2
Description: the amount of salt in the wine.
## [1] "Chlorides Statistics Summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The Chlorides Histogram show a right skewed distribution that is peaking around 0.08 g/dm3. Most of wines has a Chlorides amount between 0.03 and 0.125 g/dm3. The histogram also shows that there is a gap around 0.3, 0.425 and 0.55. Also, the max value of for chlorides is 0.611. I wonder how chlorides affects the wine quality?
Let’s subset the data and see the quality of wines with amount of chloride greater than 0.35 g/dm3. It seems that we have 0 of wines observations having quality level of 8. I think the chlorides amount might might affect the wine quality. We will try to figure that out in the next sections.
##
## 4 5 6 7
## 1 13 3 1
Description: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
## [1] "Free.Sulfur.Dioxide Statistics Summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The Free.Sulfur.Dioxide Histogram shows a right skewed distribution that is peaking around 6. The majority of wines have an amount of free.sulfur.dioxide between 2.5 and 32.5 mg/dm3. I want to explore the quality of wines that have amount of free.sulfur.dioxide greater than 65 mg/dm3. Do they have high quality ranking?
description: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
## [1] "Total.Sulfur.Dioxide Summary Statistics (g/dm3):"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00600 0.02200 0.03800 0.04647 0.06200 0.28900
Total.Sulfur.Dioxide histogram shows a right tailed distribution where most of values are less than 0.3 g/dm3. The distribution peaks around 0.028 g/dm3. The majority of wines has an amount of total sulfur dioxide between 0.01 & 0.085 g/dm3.
Description: the density of water is close to that of water depending on the percent alcohol and sugar content
## [1] "Density Summary Statistics (g/cm3)"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density Histogram shows a normal distribution with mean of 0.997 g/cm3. There are some gaps exist on both tight & left tails. I wonder how the density affect the wine quality? I believe the closer it gets to the water density the higher quality it has.
Description: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
## [1] "PH Summary Statistics:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
PH histogram show a normal distribution with mean of 3.3 on pH scale. There is less than 1% of wines that are close to be very acidic and none of the is very basic on pH scale.
Description: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
## [1] "Sulphates Statistics Summary (g/dm3):"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The Sulphates has a right tailed distribution that is peaking around 0.6. It also shows many gaps after the value of 1.2. Less than 0.4% of red wine observations has amount of sulphates greater than 1.65 g/dm3.
Description: the percent alcohol content of the wine.
## [1] "Alcohol (% by volume) Summary Statistics:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol histogram shows a positive skewed distribution with mode of 9.5 and mean around 10.4 % by volume. The majority of have between 9 and 13 % of alcohol. I think wine with qood quality will have a high percent of Alcohol. Let’s see if I am right or not in the next sections.
Description: the wine quality (score between 0 and 10)
## [1] "Quntity Staistics Summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The Quality Percentile Chart shows that around 80% of red wines have quality level of 5 or 6. Around 1.25% of red wines have quality level greater than 7 which is very small percent. Now, I am thinking about the chemical components and their amounts that was included in those 1.25% of highest quality red wines. What are the chemical prosperities that affect the wine quality?
The data has 1599 red wine observations with 12 features ( as described in details in previous section)
After googling and exploring the data I think the main features that might influence the quality of red wines are citric acid amount, and alcohol percent.
Fixed Acidity, volatile acidity, chlorides and residual sugar would be helpful in defining the red wine quality as they are affecting the taste characteristics of wine which are very important.
Before I start plotting the data. Let’s the find the correlation between the different variables just to be sure of the main features of interest:
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The above matrix shows the following:
Now, let’s test and know more about the above associations.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
I think we can say there is a slight linear relationship between quality & alcohol percent. The data shapes horizontal stripes. As alcohol % increase the quality increase.
##
## Pearson's product-moment correlation
##
## data: redwine$quality and redwine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
I’d like to see the mean alcohol % per each quality
It seems that the highest quality wine has the highest alcohol % by volume. The chart shows some outliers that violate this assumption such as wines with quality level of 5. The blue line shows how the mean alcohol% forms a linear relation with the quality level, so I think our assumption still true.
Statistics Summary:
## # A tibble: 6 x 4
## quality mean_alcohol_percent median_alcohol_percent total_count
## <dbl> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 10
## 2 4 10.265094 10.000 53
## 3 5 9.899706 9.700 681
## 4 6 10.629519 10.500 638
## 5 7 11.465913 11.500 199
## 6 8 12.094444 12.150 18
The correlation coefficient value is less than 0.3 so I think there is no strong association between citric acid amount & wine.
##
## Pearson's product-moment correlation
##
## data: redwine$quality and redwine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
After using jitter..
From above chart, the mean of citric acid amount (g/dm3) is increasing as we move from the lowest wine quality level to the highest quality level. Also, we should mention that the differences are not that big but it still shows a slight linear relationship.
Citric acid amount and wine quality statistics summary:
## Quality Level: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## [1] "-------------------------------------------------------------------"
## Quality Level: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## [1] "-------------------------------------------------------------------"
## Quality Level: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## [1] "-------------------------------------------------------------------"
## Quality Level: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## [1] "-------------------------------------------------------------------"
## Quality Level: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## [1] "-------------------------------------------------------------------"
## Quality Level: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
## [1] "-------------------------------------------------------------------"
## NULL
The majority of zero citric acid wines belong to the average quality of red wines. To see this, I will subset the wines with zero citric acid to see their distribution.
## Quality Freq
## 1 3 3
## 2 4 10
## 3 5 57
## 4 6 54
## 5 7 8
## Quality Freq Rel.Freq
## 1 3 3 2.27
## 2 4 10 7.58
## 3 5 57 43.18
## 4 6 54 40.91
## 5 7 8 6.06
The below graph shows a moderate negative association between wine quality & volatile acidity amount.
The least quality wines -that belong to level 3- have the maximum value of the volatile acidity amount, but the highest quality wines -that belong to levels 7 and 8-have the minimun value of volatile acidity amount. That confirm what was mentioned, the increase in the amount of volatile acidity might lead to unpleasant taste.
Volatile Acidity Amount and wine quality statistic summary:
## Quality Level: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## [1] "-------------------------------------------------------------------"
## Quality Level: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## [1] "-------------------------------------------------------------------"
## Quality Level: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## [1] "-------------------------------------------------------------------"
## Quality Level: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## [1] "-------------------------------------------------------------------"
## Quality Level: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## [1] "-------------------------------------------------------------------"
## Quality Level: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## [1] "-------------------------------------------------------------------"
## NULL
##
## Call:
## lm(formula = quality ~ volatile.acidity, data = subset(redwine,
## volatile.acidity <= quantile(redwine$volatile.acidity, 0.999)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.78977 -0.54547 -0.01325 0.47198 2.92568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.55757 0.05841 112.27 <2e-16 ***
## volatile.acidity -1.74500 0.10503 -16.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7436 on 1596 degrees of freedom
## Multiple R-squared: 0.1474, Adjusted R-squared: 0.1469
## F-statistic: 276 on 1 and 1596 DF, p-value: < 2.2e-16
Building a linear model using the volatile.acidity as a predictor –> R-squared has a small value, so I think knowing the volatile acidity amount alone might not be adequate to predict the wine quality.
##
## Pearson's product-moment correlation
##
## data: redwine$quality and redwine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
There is no a strong association between wine density & its quality.
Density and wine quality statistics summary:
## Quality Level: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9962 0.9976 0.9975 0.9988 1.0010
## [1] "-------------------------------------------------------------------"
## Quality Level: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9956 0.9965 0.9965 0.9974 1.0010
## [1] "-------------------------------------------------------------------"
## Quality Level: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0030
## [1] "-------------------------------------------------------------------"
## Quality Level: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0040
## [1] "-------------------------------------------------------------------"
## Quality Level: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0030
## [1] "-------------------------------------------------------------------"
## Quality Level: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
## [1] "-------------------------------------------------------------------"
## NULL
The below graph goes against my intuition. It shows that there is no association between both variables. Most of wines have between 1.5 to 3 g/dm3
of residual sugar. I can’t say that the variance in our samples’ wine density is due to the change in sugar amount.
##
## Pearson's product-moment correlation
##
## data: redwine$residual.sugar and redwine$density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
There is a moderate association between citric acid amount and the fixed acidity. The below graph show that the fixed acidity amount is increasing with the increase of citric acid amount.
##
## Pearson's product-moment correlation
##
## data: redwine$fixed.acidity and redwine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
Yes, I observed that: there is a positive correlation between citric acid amount and fixed acidity amount.Density also strongly correlates with alcohol percent by volume included.
There is a negative association between alcohol% by volume and density (g/cm3). The average quality of wines have a density between 0.995 and 1 g/cm3 and alcohol less than 11% by the wine volume. As alcohol percent increase the wine quality increase and density decrease.
Majority of wines with average quality (levels 5 & 6 ) tend to have small citric amount less than 0.5 (g/dm3). Wine Quality of level 8 shows a clear linear relationship between citric acid amount and density.
The above density chart confirms the negative correlation between volatile acidity amount and the quality. Wines with highest quality tend to occur more often where lower values of volatile acidity were included, while wines with lowest quality tend to include higher amount of volatile acidity.
The graph above shows that, highest quality wines tend to happen where volatile acidity amount less than 0.6 (g/dm3) and citric acid amount between 0.25 and 0.6 (g/dm3), while lowest quality wines tend to have less amount of citric acid (less than 0.125 g/dm3) and higher amount of volatile acidity. Average wine quality tend to happen where volatile acidity amount is between 0.2 and 0.8 (g/dm3) and citric acid amount less than or equal to 0.5 g/dm3. The yellow line shows the negative association between volatile acidity amount & citric acidity amount.
Based on the previous analysis, we can conclude that the red wine quality is strongly affected by the following variables:
Majority of wines with the highest quality tend to have lower amount of volatile acidity while majority of lowest quality wines tend to have higher values of volatile acidity amount.
The graph above shows how the quality is affected by the alcohol % included. There is a clear positive relation that is represented with the black line (mean value for alcohol per each quality level).
Volatile acidity amount & Alcohol % by volume are good predictors that can be used in predicting red wine quality. This graph above shows how each quality level of red wine is defined according to those two variables. As we mentioned before, red wine with highest quality tend to have small amount of volatile acidity and high amount of alcohol compared to the average value.
There were no strong correlation between dependent and independent variables. There were outliers in each quality level which might have affected the analysis. Our data sample doesn’t include data for all quality levels (1 to 10), so we don’t know if our findings will apply to wines with quality level of 10 or 1.
By visualizing bivariate & multivariate plots, we had the chance to see how wine quality is affected by the amount chemical components included. Our final findings showed that wine quality is clearly affected by volatile acidity amount, citric acid amount and alcohol percent by volume included. The analysis also confirmed our intuitions about the increase of volatile acidity amount might lead to decrease the wine quality because of undesired salty taste it might has. Highest quality wines tend to have higher alcohol percent and lower citric acid amount.
I think the analyses can be enriched if we have more data that belong to the missing quality levels. Also, if we have more categorical variables that describe the wine quality. I think we can convert some numeric variables to categorical variables by mapping their values into tiers, for example, pH variable: any wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. we can map those values into 3 or 4 categories such as (pH value between 0 and 3 –> very acidic, >3 and <= 4 –> moderate acidic, >4 and <= 8 –> less acidic , and >8 –> tend to be very basic). Also, the analysis can be enriched by checking the relationships between all independent variables as we might find some interested combination that affect the wine quality.